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Online social networks play a major role in the spread of 
information at very large scale and it becomes essential to 
provide means to analyse this phenomenon. In this paper 
we address the issue of predicting the temporal dynamics of 
the information diffusion process. We develop a graph-based 
approach built on the assumption that the macroscopic dy- 
namics of the spreading process are explained by the topol- 
ogy of the network and the interactions that occur through 
it, between pairs of users, on the basis of properties at the 
microscopic level. We introduce a generic model, called T- 
BaSIC, and describe how to estimate its parameters from 
users behaviours using machine learning techniques. Con- 
trary to classical approaches where the parameters are fixed 
in advance, T-BaSIC's parameters are functions depending 
of time, which permit to better approximate and adapt to 
the diffusion phenomenon observed in online social networks. 
Our proposal has been validated on real Twitter datasets. 
Experiments show that our approach is able to capture the 
particular patterns of diffusion depending of the studied sub- 
networks of users and topics. The results corroborate the 
"two-step" theory (1955) that states that information flows 
from media to a few "opinion leaders" who then transfer it 
to the mass population via social networks and show that 
it applies in the online context. This work also highlights 
interesting recommendations for future investigations. 

1. INTRODUCTION 

The Web 2.0 - through the concepts of "participatory" 
and "social" web - allows hundreds of millions of Internet 
users worldwide to produce and consume content. Thus, 
the Web provides access to a very vast source of informa- 
tion on an unprecedented scale. Online social networks play 
a major role in the diffusion of this information and have 
proven to be very powerful in many situations, such as the 



1 This paper is an updated and extended version of [8j. 



2010 Arab Spring [To] or the 2008 U.S. presidential elections 
[12] . They permit people to spread ideas, to organize groups 
and actions in a new way and we can consequently consider 
that they add a whole new layer to the human social life. In 
consideration of the impact of online social networks on the 
society, understanding the mechanics and dynamics of these 
networks is a critical research objective. Since communica- 
tions occurring online are recorded, very large amounts of 
data are available for researchers who can exploit them to 
develop predictive models for information diffusion in online 
social networks. This proves to be a challenging task, due to 
(i) the particular laws that govern these networks, (ii) the 
wide diversity in users profiles and, obviously, (iii) the large 
scale of these structures. 

"Information diffusion" is a generic concept that refers to 
all processes of propagation in a system, regardless of the 
nature of the object in motion. The diffusion of innovation 
over a network is one of the original reasons for studying 
networks and the spread of disease among a population has 
been studied for centuries. The models developed in the con- 
text of social networks assume that people are influenced by 
actions taken by their surrounding in the network, in other 
words, they model processes of "information cascades" [3]. 
That is why the path followed by an information in the net- 
work is often referred to as the "spreading cascade". In the 
case of online social networks, one can be interested by the 
spread of particular objects like hashtags on Twitter, URLs, 
or even broader concepts like topics for instance. 

PROBLEM DEFINITION: Having a set of users in a 
social network (with explicit or inferred connexions), com- 
municating through a messaging system in a closed environ- 
ment, and a piece of information, how to predict the degree 
of adoption of such information in the provided social net- 
work for a given period of time, i.e. the temporal patterns 
of the dynamics? 

A closed environment here means that only internal con- 
straints are considered. For example, we don't take into ac- 
count the possibility that information may come from exter- 
nal sources like news sites. Modeling information diffusion 
first requires to define the set of actors that can potentially 
be involved. In the context of online social networks, an ac- 
tor is referred to as a node. The simplest way to describe the 
spreading process is to consider that a node can be whether 
activated or not (i.e. informed or not), and then, the prop- 



agation process can be viewed as a sequence of activation 
of nodes. A diffusion process occurring inside a network is 
characterized by two aspects: (i) its topology and (ii) its 
temporal dynamic. 

Understanding, capturing, and being able to predict such 
phenomenon can be helpful for several areas such as market- 
ing, security, and Web search. These use-cases fall all un- 
der either of these two well defined problems: (i) influence 
maximization [l4], e.g. maximizing spread of information, 
and (ii) influence minimization, e.g. minimizing spread of 
misinformation [1J [15]. Most of existing predictive models 
focus on the topology of the process and are based on uni- 
dimensional feature spaces. They intend to predict prop- 
erties like the depth of the spreading cascade or the total 
size of the reached population and vastly ignore the tem- 
poral dimension. In this paper, we consider the issue of 
predicting the temporal dynamics of the diffusion process 
- more specifically the spread of topics - in online social 
networks. Our initial assumption is that the macroscopic 
dynamics (i.e. observed overall the network) of the spread- 
ing process are explained by the topology of the network and 
the interactions that occur through it, between pairs of users, 
on the basis of properties at the microscopic level (localized 
in the network). The contributions of this paper are the 
following: 

1. An analytical discussion about how to detect spreading 
topics and the features that may explain the diffusion 
process. This step has been performed using a dataset 
crawled from Twitter. It enabled us to understand the 
overall process of information diffusion in a real social 
network. 

2. A new model for information diffusion modeling in on- 
line social networks and a set of features used to esti- 
mate its parameters. This new model, T-BAsIC, per- 
mits a deeper and realistic integration of time in the 
prediction process. The features are based on users be- 
haviour and belong to three dimensions (social, topic, 
and time). 

3. An experimental evaluation that aims to assess the ef- 
ficiency of our approach and the validity of the under- 
lying assumption (i.e. the macroscopic diffusion pro- 
cess is explained by the sum of microscopic interactions 
that occur because of local properties). 

The rest of this paper is organized as follows. Section [2] 
reviews various categories of related work and discusses their 
relation to ours. In Section [3] we present the data and the 
analysis we performed. Then in Section [4] we describes the 
proposed model in details. In Section[5] a set of experiments 
is described to evaluate the efficiency of our modeling and 
the validity of the underlying assumption. We conclude and 
provide some future work in Section [6] 

2. RELATED WORK 

In this section we review two categories of related work: 
(i) general modeling of spreading processes in complex sys- 
tems, classified into graphical and non-graphical approaches, 
and (ii) recent predictive models of information diffusion in 
online social networks (OSNs). 



2.1 Diffusion in complex systems 

Graph based approaches. Classical graph based ap- 
proaches assume the existence of a graph structure and focus 
on the topology of the process. They follow either Indepen- 
dent Cascades (IC) [7] or Linear Threshold (LT) 14] model. 
They are based on a directed graph where each node can 
be activated (i.e. informed) or not, with a monotonicity as- 
sumption, meaning that activated nodes cannot deactivate. 
The IC model requires a diffusion probability to be associ- 
ated to each edge whereas LT requires an influence degree 
to be defined on each edge and an influence threshold for 
each node. For both models, the diffusion process proceeds 
iteratively in a synchronous way along a discrete time-axis, 
starting from a set of initially activated nodes. In the case of 
IC, for each iteration, the newly activated nodes try once to 
activate their neighbours with the probability defined on the 
edge joining them. In the case of LT, at each iteration, the 
inactive nodes are activated by their activated neighbours 
if the sum of influence degrees exceeds their own influence 
threshold. Successful activations are effective at the next it- 
eration. In both cases, the process ends when no new trans- 
mission is possible, i.e. no neighbouring node can be con- 
tacted. These two mechanisms reflect two different points of 
view: IC is sender-centric while LT is receiver-centric. Both 
models have the inconvenience to proceed in a synchronous 
way along a discrete time-axis, which doesn't suit what is 
observed in real social networks. In order to make these 
models more adapted to this particular context, Saito et al. 
recently proposed asynchronous extensions of these models, 
namely AsIC and AsLT 20 , that use a continuous time- 



axis and require a time-delay parameter on each edge of the 
graph. 

Non-graph based approaches. Classical non-graph 
based approaches don't assume the existence of a graph 
structure and have been mainly developed to model epidemi- 
ological processes. They classify nodes into several classes 
(i.e. states) and focus on the evolution of the proportions of 
nodes in each class. The two most common models are SIR 
and SIS [9j [l8] , where S stands for "susceptible", / for "in- 
fected" (i.e. informed) and R for recovered (i.e. refractory). 
In both cases, nodes in the S class switch to the / class with 
a fixed probability j3. Then, in the case of 575*, nodes in 
the I class switch to the S class with a fixed probability 
7, whereas in the case of SIR they permanently switch to 
the R class. The percentage of nodes in each class is given 
by simple differential equations. Both models assume that 
every node has the same probability to be connected to an- 
other and thus connections inside the population are made 
at random. But the topology of the nodes relations is very 
important in social networks and thus the assumptions made 
by these model are unrealistic. 

2.2 Information diffusion in OSNs 

Various studies in the context of social networks have been 
conducted with the aim of predicting properties of the in- 
formation spreading process. Most of them focus on topo- 
logical properties. For instance, Bakshy et al. |T] proposed a 
graphical approach that aims to predict the size of the cas- 
cade generated by the diffusion of a URL in Twitter graph 
of followers, starting with a single initial user. This sender- 
centric model relies on a regression tree and some simple 
social attributes and the past influence of the initial user 
only. The influence of the initial user is approximated by 



counting the number of cascades (implicit cascades inferred 
from the follower graph) in which he was involved in the 
past. Galuba et al. [6] also studied the diffusion of URLs 
in Twitter, but from a receiver-centric point of view, and 
proposed to use the LT model to predict which users will 
adopt which URL. Yang and Counts [51] used survival anal- 
ysis to examine the impact of attributes from both users 
and content to predict the size of the cascades generated 
by the spread of topics in Twitter. In order to do so, they 
exclusively focused on targeted tweets so they can directly 
identify the explicit cascade of diffusion. They found that 
both user and content attributes were relevant predictors of 
the diffusion efficiency. 

To the best of our knowledge, the Linear Influence Model 
developed by Yang and Leskovec [22] is the only real predic- 
tive model for the temporal dynamics that has already been 
proposed. They studied the diffusion of hashtags in Twit- 
ter and proposed a model based on the assumption that the 
influence of a node depends on how many other nodes it 
influenced in the past. However, there is a substantial dif- 
ference with our work because this approach is non-graph 
based and doesn't study nodes attributes. Therefore, this 
approach doesn't take advantage of any knowledge about 
the topology of the network. Moreover, in their modeling, 
a node corresponds to the aggregation of 100 Twitter users, 
which doesn't permit to study the diffusion at a "user to 
user" level. 

Given this state-of-the art, we propose a generic graph- 
based method, so the topology of the network is exploited, 
to model information diffusion. We also detail how to apply 
it on Twitter to predict the temporal dynamics of the spread 
of topics among users. This is a particularly interesting con- 
tribution of our work since all existing approaches have been 
mainly focusing on the prediction of depth of the cascades 
and/or the final volume of the propagation. 

3. DIFFUSION OBSERVATION AND REP- 
RESENTATION 

In this section, we discuss some observations we have per- 
formed in order to understand the diffusion process in so- 
cial networks and extract some underlying facts to represent 
such phenomenon. For availability reasons mainly, we - like 
the majority of studies that address information diffusion 
modeling in social networks which have used non-synthetic 
data - build the observation part of this paper on data com- 
ing from Twitter. This allows us to easily position and com- 
pare our approach with related work. Twitter is a micro- 
blogging service that allows its users to publish public di- 
rected or undirected short messages (140 characters at most) 
and to follow other users that interest them. Send a di- 
rected message is achieved by mentioning the targeted users 
directly in the content with the convention "@username". 
Both directed and undirected messages are automatically 
forwarded to the followers but directed messages aim more 
particularly at one or more specific users. Overall, Twitter 
forms an online social network where the information flows 
from place to place, in two different ways: (i) it flows in a 
passive manner via the following ties and also (ii) actively 
because of users that directly send information to others via 
the mentioning practice. Mathematically, this network is 
represented as a directed multi-graph comprised of two sets 
of edges: (i) the set of following edges, which constitutes 



the passive part of the network, commonly called "follower 
graph"; (ii) the set of mentioning edges, that represents the 
active part of the social network. 

DEFINITION 1 (Active/Passive Directed Edge). 
An active edge results from an explicit communication (i.e. 
message passing) between two nodes in the network. It trans- 
lates the existence of an active transmission of information 
between two nodes. A passive edge simply means that a node 
is exposed to the content produced by another. 

We base our study on a 467 million Twitter posts dataset 
from 20 million users covering a 7 months period from June 
1, 2009 to December 31, 2009 gathered by Yang and Leskovec 
[23] . Each tweet contains its author, its content, and the 
time at which it has been posted. In addition to that, we 
know the sets of followers and followees for each user thanks 
to the capture of Twitter graph of followers (1.47 billion di- 
rected edges) made by Kwak et al. [16] at the same time (i.e. 
passive edges) . Finally, we complete the network by extract- 
ing the active edges based on the mentions contained in the 
tweets. This data meets the common criteria for dataset val- 
idation in social network analysis: (i) large-scale, (ii) com- 
pleteness, and (iii) realism. A key task in the diffusion model 
we are proposing is the identification and detection of topics, 
a process explained in what follows. 

3. 1 Topic extraction and diffusion observation 

We intend to predict the spread of information among 
Twitter users. In contrast to studies that have investigated 
the diffusion of simple objects, such as URL or hashtags 
[22| [l] [6], we focus on topics as the main object to follow. 
This allows us to have a global view of all the interactions 
regarding a specific information. This also prevents from 
several annoyances, like the side effects that potentially exist 
between distinct URLs that point to similar resources for 
instance. By cons, it is not so easy to detect spreading 
topics. For the purpose of our study, we define a topic as 
follows: 

DEFINITION 2 (Topic). A topic is a minimal set of 
co-occurring terms (i.e. keywords) that a related tweet should 
contain and which spans over a given period of time. 

We are interested by recurrent terms that experience a 
peak in their usage during a significant period of time. It 
means that we are not interested by non-recurrent terms 
that are not observed before and after the period during 
which they are popular. To find topics that fit this defini- 
tion, we use the method described hereafter to find relevant 
terms and then manually investigate further to precisely de- 
fine interesting set of terms. We select all the tweets pub- 
lished during the period of time we want to study and per- 
form a discretization. To do so, the data is transformed 
into an ordered collection of documents, where each doc- 
ument is the aggregation of 4 hours of tweet^] Then we 
compute the vector of the number of occurrences of a term 
in each document, noted Oterm- We define the interesting- 
ness of a term as the score computed by Equation [l] Note 
that highest values are obtained by terms that maximize 
the ratio max(Ot e 7-m)/avg(Ot<:rm) and minimize the ratio 

1 This is done using the Lucene (http : //lucene . apache . 
org/core/ library to index their content). 



term 


score 


Christmas 


24.56 


snow 


22.88 


iphone 


15.19 


google 


15.05 






twitpic.com 


6.64 


twitter 


6.34 


bit.ly 


5.14 


lol 


5.12 



(a) Christmas 



Table 1: Highest and lowest ranked terms accord- 
ing to the interestingness score during the month of 
December 2009. 



min(Oi erm )/avg(Ot e rm). Finally, we rank the terms accord- 
ing to their score in order to identify which topics to focus 
on. 



score(term) 



avg(O t , 



+ min(O term ) x max(O t . 



min(O term ) x avg(O te 



(1) 

See Table[T]for the 4 best and worst ranked terms in 2009. 
Unsurprisingly, "christmas" is the top ranked term, because 
it is a sustained discussion topic throughout the month and 
suddenly becomes an extremely popular term right before 
and after Christmas on December 25 th . Therefore, this is 
a particular case where the peak of activity is linked to 
an annual event and doesn't result from the spread of an 
interesting information between Twitter users. Let's con- 
sider now the example of the term "iphone". We observe a 
peak of usage starting around December 8 th on Figure ljb) . 
By searching through specialized websites, we find that a 
rumour about the "release" date of the new version of the 
smartphone surfaced on this day and then spread through 
social networks. This is confirmed by a sharp increase in 
the frequency of co-occurrence of these two terms {"iphone", 
"release") in tweets published from December 8 th to 15* . 
Therefore, the set {"iphone", "release"} defines an interesting 
spreading topic. It is the same for the term "google", that 
experiences a peak of activity in December because of the 
spread of a rumor about the buyout of a company whose 
technology could contribute to Google Wave (thus we define 
the topic {"google", "buy"}). On the contrary, "twitpic.com" 
has a relatively steady volume (because it appears each time 
a user posts a picture with her tweet) and is therefore bad 
ranked. 

Through this analysis, at least two dimensions that are 
needed to capture the diffusion process have emerged: (i) 
the topical dimension since we observed that the various 
topics had different behaviours in terms of volumes for in- 
stance, meaning that users are not interested in all topics 
but generally in a subset of those topics; (ii) the temporal 
dimension, because we observed a common cyclic pattern 
to all topics that is due to, e.g. the switch from day to 
night, working hours, and the total time for the spread of 
the information. 

3.2 Representation of the propagation process 

We exploit the data related to selected topics to observe 
aspects of the diffusion process in Twitter. Thus, we build 
the structure that transcribes "who influenced whom" for 
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Figure 1: Evolution of the volume of particular 
terms during December 2009, namely, from top to 
bottom, "christmas", "iphone", "google" and "twit- 
pic. com". 



each topic and we also capture the temporal dynamics of the 
diffusion. This process is built on two concepts: activation 
sequence and spreading cascade. 

DEFINITION 3 (Activation Sequence). An activa- 
tion sequence is an ordered set of nodes capturing the order 
in which the nodes of the network adopted the topic, i.e. got 
informed. 

DEFINITION 4 (Spreading Cascade). A spreading 
cascade is a tree having as a root the first node of the activa- 
tion sequence. The tree captures the influence between nodes 
(i.e. who transmitted the information to whom) and unfolds 
in the same order as the activation sequence. 

Having a topic and its minimal set of keywords, we can 
easily detect all the related messages and generate the time- 
series of the volume of tweets induced by the diffusion of the 
topic (i.e. its dynamics). We also determine the sequence 
of nodes activation in the network. Then we aim to solve 
the problem of reconstructing the graph of diffusion (i.e. 



non -diffusion 



non-diffusion 




Figure 2: Example of a spreading cascade. Nodes 
colored in light gray represent users that have 
tweeted about the topic of interest at time t, with 
ti < t2 < tg. Users represented by nodes coloured in 
dark grey didn't. 



the spreading cascade) by connecting the activated nodes 
between them. We base the construction of this graph on 
the topology of the passive part of the network. In other 
words, we model the spreading process over the following 
links. It means that for each activated node, we want to infer 
which other previously activated node among its followees 
had influenced her. As it is discussed in [2], in the case where 
several followees are activated, there are basically three ways 
to assign influence: (i) assign it to the followee that adopted 
the topic first, (ii) assign it to the last followee to react, or 
(iii) assign it to all the followees. In this study we assume 
that individuals are influenced by the followee that adopted 
the topic most recently (i.e. the second method, referred 
to as "Last Influence"). Let us illustrate how we build the 
spreading cascade of a topic with the following example. 
Let's say we have a social network of 6 users, where 1)2, 1)3, 
i>4 and t>5 follow Vi; v$ and va follow 114. Nodes vi, V4 and 
V5 are activated in this order. Therefore, based on the "Last 
Influence" principle, we say that instances of diffusion have 
occurred between v\ and V4, and V4 and V5, whereas there 
are instances of non-diffusion between v\ and 1)4, V4 and Ve, 
etc. Finally we can build the spreading cascade shown on 
the Figure [2] where each edge is directed and labeled with 
either "diffusion" or "non- diffusion". 

With the methods we described, we are able to (i) detect 
interesting spreading topics and (ii) capture their diffusion 
process. Thus we can construct datasets for various topics, 
consisting of instances belonging to the binary class {diffu- 
sion, non-diffusion} and described by a pair of users and a 
timestamp. Moreover, a third dimension is explicitly high- 
lighted thanks to this representation: the social dimension 
since the information flows due to influence between mem- 
bers of the social network. As a result, the three dimensions 
(social, topic, time) are the foundations of the model we are 
proposing and which we describe in the next section. 

4. PROPOSED METHOD 

In this section we introduce the method we propose to pre- 
dict the diffusion phenomenon observed in online social net- 
works. To start, we formally define the Time-Based Asyn- 
chronous Independent Cascades (T-BAsIC) Model underly- 
ing our approach. Then, we present the list of features com- 



puted for each member of the network and finally we de- 
scribe how they are used to estimate the model parameters. 
Table [2] summarizes the notations used in this section. 

4. 1 Time-Based Asynchronous Independent Cas- 
cades 

We begin by reminding the definition of AsIC (i.e. Asyn- 
chronous Independent Cascades Model) according to |20] , 
which is an extension of the IC model so the diffusion can 
unfold in continuous-time. It models the diffusion of infor- 
mation through a directed network G = (V, E), where V is 
the set of all the nodes and E(G V x V) is the set of all 
the links. For each link (v x ,v y ), two real values are fixed 
in advance: Pv x ,v v , with < p Vx ,v y < 1, and r Vx , Vy , with 
r «x,« H - > 0- pvx,v y is referred to as the diffusion probability 
and r Vx ,v v is referred to as the time-delay parameter. The 
diffusion process unfolds in continuous-time and, as for IC, 
starts from a given set of initially activated nodes 5*. Each 
node v x that becomes activated at time t is given a single 
chance to activate each of its inactive neighbours v y with 
probability Pv x ,v v at time t + r VlCtV . The stopping condi- 
tion of the process is the same as for IC, i.e. when no more 
activations are possible. 

To enable a better capture of the dynamics underlying 
the diffusion process in social networks, we propose another 
manner in considering the parameters of the diffusion mod- 
els, incorporated into the T-BAsIC model. Thus, in the 
T-BAsIC model, a real value r Vw:Vy is fixed in advance and 
a real function fv x ,v y (i) is defined for each link (v x ,v y ), with 
< fv x ,v y (t) < 1. Unlike other models, the diffusion proba- 
bility is not fixed in advance, but is a time dependent func- 
tion f Vx <v (t) referred to as the diffusion function. Thus, the 
propagation process unfolds in the same way as AsIC, but 
the algorithm simulates the course of days by keeping a clock 
and Pv x ,v y is computed on-demand, according to fv x ,v v (t). 
We use the model to produce the time-series that represent 
the evolution of the volume of tweets generated by the dif- 
fusion of a topic introduced in a given social network by a 
certain subset S of users. Figure [3] illustrates this principle 
and shows the input and output of T-BaSIC. 
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Figure 3: The T-BaSIC Model predicts the cascade 
of diffusion along a continuous time-axis based on 
the time-delay and diffusion function on each edge, 
and the initial active node set S. 



4.2 Features space 

Our model computes a diffusion probability relying on 
three dimensions: social, semantic, and time. We denote 



Notation 


Description 


V 


the set of all vertices (i.e. users) in the social network 


v x G V 


a particular node in the social network 


s 


a subset of vertices S C V 


E 


the set of all edges in the social network 


M 


all the messages (i.e. tweets) of the environment 


M v 


the set of messages published by a user v G V 


M v 


the set of users who are mentioned in the messages of a user u £U 


M v 


the set of users who mentioned the user v G V in their messages 


tM v 


all the messages which have mentioned a user v G V 


K 


the set of all keywords used in the messages published inside the network 


ki G K 


a specific keyword contained in the messages published inside the network 


K v cK 


the set of keywords included in the messages published by a user v G V 


C — {Ci, C2, C g } 


the set of all the topics, a is a particular topic. 


C% ~ \k\ i •••3 ^p} 


the vector of keywords kj G K describing a topic Cj 



Table 2: Notations used in this paper. 



Pv x vy(i,t) = fv x ,v y {t) the diffusion probability of an infor- 
mation i associated to a topic Cj at time t between users 
v x G V (sender) and v y G V (receiver). The attributes 
we derive from these dimensions are either numerical values 
varying between and 1 or boolean values. Their calcula- 
tion is based on the past activity of the user(s) for a given 
time period. Here we give the metrics formulations for a 
period of one month. 

Social dimension: This dimension intends to quantify 
the social interactions occurring between users. It is based 
on metrics that mainly rely on topological properties of the 
active part of the Twitter social network. This choice is 
motivated by the predictive power of these links in the dif- 
fusion process, as Yang and Counts stated in [2l]. These 
five metrics concern whether a user or a pair of users and 
are described below. 

• Activity (I): an activity index expresses users' volume 
of tweets they produce. The activity is computed as 
the average amount of tweets emitted per hour bounded 
by 1. For a user u, the formula is as follows: 



follows: 



l(v) 



\M V \ 

e 

l 



if |M„| < e 
Otherwise 



(2) 



with e = 30.4 x 24 to obtain the hourly frequency. 

Social homogeneity (H): a social homogeneity index 
for v x G V and v y G V reflects the overlap of the sets 
of users they interact with. It is computed with the 
Jaccard similarity index that we defined as the size of 
the intersection of the sets divided by the size of their 
union. 



K(Vx,Vy) = 



\M Vi DM Vj \ 
\M Vz UM Vj \ 



(3) 



The ratio of directed tweets for each user (dTR) pro- 
vides an idea about the role she plays in the spread 
of information. A user with an important ratio of di- 
rected tweets tends to play an active role whereas a 
user with a low ratio can be seen as a more passive ac- 
tor. It should be noted that our definition of directed 
tweets includes retweets. This ratio is computed as 



dTR(t;) = 



\Vy\ 

|M„| 





if \M V \ > 
Otherwise 



(4) 



A boolean value for each user regarding the mention- 
ing behaviour to capture the existence of an active in- 
teraction in the past. This feature can be somehow 
regarded as a "friendship" indicator in the case where 
both users have a positive value. This constitutes a 
different definition of friendship than the one given by 
Huberman et al. Ill], where a user is friend with users 
she mentioned at least twice. 



hM(v x 



1 if v y G M Vx 
Otherwise 



(5) 



The mention rate (mR) [51] of each user represents 
the volume of directed tweets she receives. Thus, the 
higher the value is, the higher the node centrality de- 
gree on the active part of the network is. All in all, 
this feature expresses the popularity of the user and 
the amount of information she is exposed to. 



mR(«) 



1 



if \tM v \ < jU 
Otherwise 



(6) 



Based on our empirical observation of the distribution 
of the mention rates we have chosen /i = 200. 

Semantic/Topical dimension: In addition to the so- 
cial features that exploit the structure of the network, we 
consider the exchanged content to refine our perception of 
users' behaviour. The proposed metric applies to a user and 
a topic and it states in a binary way if the user has already 
tweeted about the given topic. 



KK(v,i) = 



1 if c\ G K v 
Otherwise 



(7) 



Temporal dimension: Finally, we consider the tem- 
poral dimension so that we can capture the fluctuation of 
users attention through time. The varying attention of the 
individuals is an important characteristic of online social 
networks that is strongly connected to the day/night cycle 
and working hours. To represent how the attention of a user 



evolves during a day, we define a receptivity function A(u, t). 
We model it in a non-parametric way and thus partition a 
day into 6 bins of 4 hours each, in order to obtain a signif- 
icant and smooth representation even for less active users. 
We define the receptivity level of a user at the time of the 
day t as the percentage of all the tweets she produced in the 
4 hours interval [t^;^], where t x <t<t y . The function is 
stored in a 6-dimensional non-negative vector noted V, with 



A(v,t) = v: 



where t' = 1 — 1 

L 4 J 



(8) 



For instance, if a user posted 60 messages through the 
month, 50 of which between 4 and 8 pm and the rest be- 
tween 8 and 12 am, her receptivity function would be V u — 
{0,0,0.167,0,0.833,0}. 

Global feature space: Overall, the metrics we just 
detailed form a 3-dimensional feature space, with 13 values 
describing each set (v x £ V, v y 7^ v x £ V , Ci £ C, t). Figure 
[4] illustrates a possible instantiation of that vector. 
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Figure 4: A pair of user (v x ,v y ) is described by 13 
features w.r.t a topic Ci and a time of the day t. 
The figure above shows how the 3 dimensions of the 
feature space are connected. 



Once the features space is constructed, the parameters of 
the model can be learned and estimated. This is performed 
through machine learning techniques and is detailed in the 
following section. 

4.3 Diffusion function parameters estimation 

First, we build a sample of data comprising 4 experimental 
social networks (sub-graphs). To build them, we first choose 
4 Twitter users at random among the millions contained in 
the data and then by selecting all users distant from at most 
2 hops according to the "following links". Each social net- 
work presents particular characteristics in terms of level of 
activity, density of the passive and active parts, and global 
size. See Table|3]for details. In each network, we capture the 
diffusion of several topics during December 2009 and build 
the spreading cascades with the method we described earlier 
in Section [3] We describe each instance of "diffusion" and 
"non-diffusion" by the 13 features related to the two con- 
cerned users, according to their activity during November 
2009. Table |] provides the mean and standard deviation 
of the numerical features of that learning dataset (balanced 
binary dataset of 20,000 instances). 

We train several classification algorithms on the super- 
vised task P(Y\F), with F={diffusion, non-diffusion} and F 
the 1 3-dimensional feature vector. Results obtained by a 
C4.5 regression tree, linear and multilayer (1 hidden layer 
of 14 nodes) Perceptrons and the Bayesian logistic regres- 
sion (BLR) are shown on Table [5] All classifiers perform 
equally, apart from C4.5 that has a slightly better precision 
rate. Because the regression tree is more vulnerable to over- 
fitting, and linear and multilayer Perceptrons give similar 



Feature 


Mean 


Standard 
deviation 


I(src) 


0.148 


0.185 


I(dst) 


0.104 


0.143 


mR(src) 


0.163 


0.258 


mR(dst) 


0.22 


0.324 


dTR(src) 


0.488 


0.242 


dTR(dst) 


0.47 


0.276 


A(src,t) 


0.306 


0.178 


A(dst,t) 


0.247 


0.192 


H(src,dst) 


0.004 


0.02 



Table 4: Mean and standard deviation of the nu- 
merical features of the learning dataset. 



results, we use the Bayesian logistic regression to define the 
diffusion function. 



Classifier 


Correctly classified 
instances 


C4.5 


91% 


Linear Perceptron 


85% 


Multilayer Perceptron 


86% 


Bayesian logistic regression 


85% 



Table 5: Classifiers performances on a 5 folds cross- 
validation. 

The BLR assumes a parametric form for the distribution 
P(Y\F). The parametric model used by the logistic regres- 
sion is as follows (as defined in |17| ): 

1 



P(Y = diffusion|F) = 



P(Y = non-diffusionjF 1 ) 



1 + CXp(TO() + J2^=l W aFa) 

exp(^ + WgFg) 

1 + exp(w + Y, a =l w o.Fa 



In more details, the Bayesian logistic regression has a pre- 
cision rate of 79% based on the attributes belonging to the 
social dimension and obtains a gain of 7% with the addi- 
tion of both temporal and semantic dimensions, leading to 
a precision rate 85% with the full feature space. Figure [5] 
illustrates the absolute normalized values of the weights (i.e. 
\w a / max(TO a )|) that the logistic regression accords to each 
feature. The "social homogeneity" has the highest coeffi- 
cient, because it has a mean of 0.004, which is much lower 
than the other features. One can also see that the most 
significant properties of the receiving node are the level of 
activity and the connection to the topic. Concerning the 
sending node, the mention rate is the most relevant prop- 
erty. 

4.4 Time-delay estimation 

For each instance of the "diffusion" class from the dataset, 
we know at what time the two users adopted the topic (i.e. 
tweeted about it), so we are able to compute the real diffu- 
sion delay. As we have just seen, the activity index of the 
receiving user is a critical parameter and we base the approx- 
imation of the time-delay on it, with the following formula: 
r v x ,v y = (1 — I{v y )) x a. Therefore, for each instance of 
diffusion we can compare real and estimated diffusion delay. 
In order to determine the optimal value of a, we define two 



Social 
network 


of users 


# of tweets 
(November) 


■#■ of following edges 


active network density 
(November) 


passive network density 


1 


24,571 


303,564 


1,928,999 


5.15 x 10 _b 


6.39 x 10~ a 


2 


44,410 


469,775 


4,398,953 


4.23 x 10~ b 


5.13 x 10~ a 


3 


11,614 


169,689 


308,849 


7.30 x KT e 


4.58 x 10 _;i 


4 


29,625 


226,753 


2,507,768 


2.79 x 10" b 


5.71 x 10" a 



Table 3: Properties of the four experimental social networks. 
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Figure 5: Visualization of the normalized weights 
computed by the Bayesian logistic regression. 



vectors, (i) the vector of real diffusion delays and (ii) the 
vector of estimated diffusion delays. Then we define E(a) 
the Euclidean distance between these two vectors depending 
of the value of a. We find that E(a) is minimal for a = 10 
and so the formula used to estimate the diffusion delay be- 
comes r VxiV — (1 — I(v y )) x 10. It means that the maximum 
diffusion delay in our modeling is of 10 hours. This aligns 
well with observations made in previous studies 6,22, that 
reveal that diffusion events occur across a time-frame of at 
most 8 to 12 hours. 



4.5 Generalization of the approach 

Having established how to estimate T-BaSIC parameters, 
we use it as a prediction engine. The required input is: (1) 
a topic, as defined in Section [3] (2) a social network de- 
scribed by: (i) the set of users V and their 3-dimensional 
description, (ii) the topology of their interconnection based 
on the following links; and (3) a subset of users S C V that 
inject the topic in the network and thus initiate the diffu- 
sion. Given this input, the algorithm unfolds and manages a 
clock, which is used to reproduce the course of day and the 
variations of users receptivity. In output, the engine gener- 
ates time-series representing the volume of tweets induced 
by the spread of the topic inside the network. 

T-BaSIC is a generic model, but the estimation of its pa- 
rameters depends on the social platform one wants to adapt 
it to. The approach we have presented can be applied to 
any social network based on the explicit declaration of so- 
cial links and that permits its users to publish both directed 
and undirected messages. The coefficients of the diffusion 
function and the time-delay can be then adjusted to the 
data during the learning step. 



5. EXPERIMENTS 

We evaluate the efficiency of our approach and model- 
ing on the task of predicting the temporal dynamics of the 
spread of topics selected with the method described in Sec- 
tion|3] We denote P Ci (t) the predicted daily volume of tweets 
for topic a in the network and R Ci (t) the real daily volume 
of tweets (i.e. observed in the data). The different networks 
used for the experiments are described in Table|3] We choose 
the users constituting the starting set S by selecting the first 
s users observed in R Ci (t). Hereafter we present the results 
we obtained with the optimal value of s for selected exam- 
ples. 

5.1 Qualitative results 

Figure [5] shows the comparison of real and predicted time- 
series for the topic {"iphone", "release"} in experimental net- 
works #1 and #2. The x axis represents time units in days 
and the y axis represents the activity level with tweets vol- 
ume as unit. The gray dashed curve corresponds to the real 
volume measured in the data while the black curve corre- 
sponds to the volume predicted by T-BaSIC. After varying 
the value of s in the experiments, the optimal prediction is 
obtained using s — 8 for network #1 and s = 5 for network 
#2. In both cases we observe a particular wave pattern, 
with different phase and amplitude. One can see that the 
model accurately captures these variations but slightly un- 
derestimates the volume. We examine more in details the 
prediction made by T-BaSIC by analysing the population of 
users involved throughout the diffusion process. We classify 
them into two groups, based on definitions by Daley and 
Kendall [5]: (i) transmitters, i.e. users who received the in- 
formation and then transmitted it to others, and (ii) stifiers, 
i.e. users that received the information but never transmit- 
ted it. We show on Figure [7] the evolution of the density of 
stifiers for the prediction made on network #1. The corre- 
lation between the volume shape and the density of stifler is 
clearly visible. Five days after the appearance of the infor- 
mation, the density of stifler is continuously rising. This is 
due in part to the low connectivity of these users. Indeed, 
they are reached by the information later in the process and 
have a lower potential of diffusion. This shows the relevance 
of the graph-based approach. 

In order to allow comparison, we now show the results ob- 
tained for another topic, {"google", "buy"} in the same two 
networks. After varying the value of s in the experiments, 
the optimal prediction is obtained using s — 11 for network 
#1 and s — 14 for network #2. Again, we can see a wave- 
pattern, but this time it is less strong, which reveals people 
are less interested by this topic. This highlights the impor- 
tance of taking into account the topical dimension into the 
computation of the diffusion probabilities. Also, once again, 
the predicted volume is inferior to the real volume. 

We obtained similar results with all the selected topics 
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Figure 6: Comparison of real and predicted time- 
series for the topic {"iphone","release"} in experi- 
mental networks #1 and #2. 
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Figure 8: Comparison of real and predicted time- 
series for the topic {"google", "buy"} in experimental 
networks #1 and #2. 
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We then compute the relative error on dynamics according 
to this formula: 
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Figure 7: Density of transmitters and stiflers. The 
curve represents the cumulation of the number of 
users that adopted the topic. 



and predictions on the four experimental social networks, 
with the size of S varying between 5 and 20. It means that, 
starting from a few informed people we are able to predict 
when peaks of attention will occur and in which propor- 
tion. This result is consistent with the "two-step flow of 
communication" theory introduced by Katz and Lazarsfeld 
[13] , that hypothesizes information flows from media to few 
"opinion leaders" who then spread it to the "mass" via social 
networks. However we found that our modeling always un- 
derestimates the global volume. This can be explained by 
the fact that we consider that individuals, apart from those 
constituting the starting set S, get information exclusively 
via their social network, as explicitly stated in the problem 
formulation referred to as the "closed environment" . But in 
reality, information is injected into Twitter throughout the 
diffusion process and not only at the beginning. 

5.2 Quantitative results 

We now quantitatively asses the efficiency of our modeling 
by computing the reduction in prediction error over the 1- 
time lag predictor [22], according to two aspects: (i) volume 
and (ii) dynamics. The 1-time lag predictor, introduced by 
Yang and Leskovec, is a simple predictor that gives P c . (t), 

such as P c At) = Rci{t — 1). We compute the relative error 
on volume estimation with the formula below: 



where d is the derivative for each point of the time-series 
that we compute in this way: 



d(RcM) 



R Cz (t + 1) - R Ci (t - 1) 



We report the reduction in prediction error on volume 
and dynamics of our approach over the 1-time lag predictor 
in Table [6] for four particular shapes of volume over time. 
These shapes correspond to the examples we just detailed. 
Overall, and as we can see it, the results are satisfactory, 
translated by the overall gain measure. 

6. CONCLUSION 

In this paper, we propose the T-BAsIC model and its 
application to data issued from Twitter. To achieve this, 
we determined with a preliminary study a set of pertinent 
features that ensure a generic model for representing infor- 
mation diffusion whose parameters are estimated with the 
considered data themselves. Indeed, this model allows to 
predict information diffusion taking into account both so- 
cial, semantic, and temporal dimensions. More precisely, 
this model is derived from the AsIC [20] theoretical model 
and relies on time-dependent parameters. We infer the dif- 
fusion probabilities between nodes of the network with a 
machine learning technique, i.e. Bayesian logistic regres- 
sion. We performed a set of experiments for different top- 
ics. The experimental results show mainly that the model 
predicts well the dynamic of the diffusion (our initial ob- 
jective). The prediction of the volume is slightly underesti- 
mated due to our initial assumption of considering a "closed 
environment". This ignores the impact of external informa- 
tion sources on the networks, which may explain the gap in 
the predicted volume and the observed volume. Still, our 
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reduction on dynamics 


25.19% 


39.23% 


29.21% 


3.22% 


24.21% 


reduction on volume 


42.89% 


47.70% 


34.49% 


40.07% 


41.29% 


overall gain 


34.04% 


43.46% 


31.85% 


21.65% 


32.75% 



Table 6: Reduction in prediction error on volume and dynamics over the 1-time lag predictor for four shapes 
of volume over time. 



results support the "two-step theory" that hypothesizes that 
only a few "opinion leaders" relay information from media 
to the "mass population" via social networks and show that 
it also apply to online social networks. 

The perspectives opened by this work are numerous. Among 
them, we determined the four main issues we want to inves- 
tigate. First, since the T-BAsIC model parameters are not 
fixed in advance, it should allow us to take into account the 
evolution, over time, of the environment for the estimation 
of diffusion probabilities. Thus it could help us to consider 
the phenomenon of "complex contagion" as introduced by 
[19| (i.e. repeated exposures to a topic have a positive im- 
pact on the probability that the user adopts it). Concerning 
the genericity of the model, another issue consists in apply- 
ing our approach on other social data from other platforms 
to study both common points and specificities of the infor- 
mation diffusion process according to the platform. A third 
one consists in enriching the semantic dimension. The use 
of text mining techniques could be useful for this challenge, 
taking into account that depending on the social platform, 
there are some specificities that must be taken into account. 
Finally, T-BAsIC is based on the AsIC theoretical model, 
that means it is built from the point of view of the diffuser 
node. It could be interesting to envision the dual approach 
considering a T-BAsLT model based on the AsLT [20] ap- 
proach focusing on the point of view of receiving node. The 
comparison of the two approaches could be of interest and 
may bring interesting insights. 
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